home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Tech Arsenal 1
/
Tech Arsenal (Arsenal Computer).ISO
/
tek-02
/
msortp.zip
/
MSORTP.DOC
< prev
next >
Wrap
Text File
|
1993-03-02
|
25KB
|
572 lines
MSORTP - Merge Sort Unit for Protected and Real Mode
Kim Kokkonen
TurboPower Software
January 1993
Introduction
---------------------------------------------------------------------
MSORTP is a unit for sorting an unlimited number of items in real or
protected mode applications. It can be used with TPW 1.0, TPW 1.5, or
BP7 for DOS real mode, DOS protected mode, or Windows targets. Because
it takes advantage of PChar strings, MSORTP cannot be used with
earlier Turbo Pascal compilers.
Although MSORTP isn't source-compatible with Borland's old LSORT unit
from the Database Toolbox, its interface is quite similar. Hence, it
is a good candidate to replace LSORT in programs being upgraded to new
platforms.
MSORTP takes advantage of all available heap space, including the
extended memory heap when running in protected mode. When it runs out
of heap space, it uses a merge sort algorithm that stores temporary
intermediate files on disk.
MSORTP is an excerpt from TurboPower Software's commercial product
B-Tree Filer.
Using MSORTP
---------------------------------------------------------------------
The following small program shows how to implement a text file sort
filter using MSORTP.
{$R-,S-,X+,I+}
program TextSort;
{-Protected mode text file sort filter}
uses
Strings, MSortP;
var
LineBuf : array[0..1024] of Char;
MaxHeap : LongInt;
MinHeap : LongInt;
Status : Word;
procedure SendToSortEngine; far;
var
P : PChar;
begin
while not Eof do begin
{Read the next line from the input file as a PChar}
ReadLn(LineBuf);
{Don't sort empty lines to avoid StrNew quirk}
if StrLen(LineBuf) <> 0 then begin
{Copy string to heap}
P := StrNew(LineBuf);
{Store the pointer in the sort engine}
if not PutElement(P) then
Exit;
end;
end;
end;
function Less(var X, Y) : Boolean; far;
begin
Less := (StrComp(PChar(X), PChar(Y)) < 0);
end;
procedure GetFromSortEngine; far;
var
P : PChar;
begin
{Return each pointer and write the associated string to output}
while GetElement(P) do
WriteLn(P);
end;
begin
{Determine how much heap space MergeSort can use}
MaxHeap := MemAvail div 10;
MinHeap := 4*MinimumHeapToUse(SizeOf(PChar));
if MaxHeap < MinHeap then
MaxHeap := MinHeap;
{Perform the sort}
Status := MergeSort(MaxHeap, SizeOf(PChar),
SendToSortEngine, Less, GetFromSortEngine,
DefaultMergeName);
if Status <> 0 then
WriteLn('Sort failed, Status ', Status);
end.
This program provides three routines for the use of MSORTP.
SendToSortEngine reads lines of text from the standard input device,
allocates space for each line on the heap, and passes the string
pointer to MSORTP by calling PutElement. Note that only the pointer is
passed to MSORTP, since there is no need for MSORTP to keep a second
copy of a string that is already stored in memory.
The Less function compares pairs of elements for MSORTP. This routine
simply uses the StrComp function from Borland's STRINGS unit to
perform a case-sensitive comparison of two PChars. Note the typecast
from the untyped VAR parameters Less receives to the PChar data type
stored in the sort engine.
Finally, GetFromSortEngine retrieves the elements in sorted order. In
this program, GetElement return a pointer to the actual string data.
When the $X+ compiler directive is enabled, as it is here, the WriteLn
statement writes the string pointed to by each pointer P.
The main block of the program calls the MergeSort function to perform
the sort. First it must decide how to partition the heap between the
needs of string storage and the needs of MSORTP. MSORTP is given about
10% of the heap and the rest is reserved for text strings. Since the
sort engine is storing 4 byte elements, this is a balanced split if
the average string length is 40 bytes. Also note that the
MinimumHeapToUse function is called to guarantee that the sort engine
gets at least 4 times as much heap space as it requires at a minimum.
In this way merging is minimized and the sort will always succeed
unless there is insufficient disk space or insufficient heap space to
hold the input strings.
The second parameter passed to MergeSort is the size of the elements
to be sorted, in this case 4 bytes, the size of a PChar. The next
three parameters are the routines used to pass elements to the sort
engine, to compare elements, and to get sorted elements back from the
engine. The final parameter is a routine used to name each merge file.
In this program we use the default routine DefaultMergeName provided
by MSORTP, which stores the merge files in the current directory.
MergeSort returns a Word result indicating the status of the sort.
The result is zero to indicate success, or a non-zero result to
explain the reason for a failure.
TMSORTP.PAS, also in this archive, is a more extensive program for
testing the accuracy and speed of the MSORTP unit.
Interfaced Identifiers of MSORTP
---------------------------------------------------------------------
The following sections describe the constants, types, procedures, and
functions interfaced by the MSORTP unit.
Constants
=========
MaxSelectors = 256;
Maximum number of selectors allocated. MergeSort allocates memory
buffers using segments of up to 65535 bytes, but the size is always
an even power of two times the sort record length. As a result, the
default value of MaxSelectors allows MergeSort to directly access
8-16 megabytes of extended memory. If 8 megabytes is insufficient,
you may want to increase MaxSelectors to 512. If 8 megabytes is more
than enough, you may want to decrease MaxSelectors. The data segment
space MSORTP uses for selectors (2*MaxSelectors bytes) dominates the
data segment usage of the unit.
MedianThreshold = 16;
The partition length below which the in-memory quick sort simply
uses the middle element of the partition for the pivot element. For
partition lengths at least this size, MergeSort uses the median of
the left, middle, and right elements for the pivot. The
median-of-three algorithm significantly improves the speed of large
in-memory sorts for input that is already sorted, reverse sorted, or
nearly sorted.
MergeOrder = 5;
Input files used at a time during merge. You can set MergeOrder to
any value in the range from 2 to 10 inclusive. However,
experimentation indicates that the default value of 5 is optimal
under a wide range of conditions.
MinRecsPerRun = 4;
Minimum number of records that must fit in memory during a sort. If
fewer records fit in memory, MergeInfo and MergeSort return an error
code. If even MinRecsPerRun records fit in memory, MergeSort
performs merging to complete the sort. Increase this constant if you
prefer that the sort fail instead of doing an excessive amount of
merging.
SwapThreshold = 64;
The record size below which MergeSort swaps complete data records.
For records SwapThreshold bytes or larger, MergeSort swaps pointers
to records instead of the records themselves. Swapping pointers is
the faster approach for large records sorted in memory, but this
approach has a memory overhead of 4 bytes per record plus a buffer
segment that must be used for a run output buffer. The default of 64
was chosen to keep the typical overhead below 10%. Reducing the
default also provides no significant improvement in performance.
Types
=====
ElementIOProc = procedure;
Specifies the type of the routine passed as the SendToSortEngine
and GetFromSortEngine parameters to MergeSort. These routines must
be declared FAR and must have no parameters.
ElementCompareFunc = function (var X, Y) : Boolean;
Specifies the type of the routine passed as the Less parameter to
MergeSort. MergeSort calls this function to compare pairs of
elements as needed. It must be declared FAR and must have the form
shown here. It should return True if and only if element X is
strictly less than element Y. You should typecast the untyped
parameters to treat them as elements of the type you are sorting.
MergeNameFunc = function (Dest : PChar; MergeNum : Word) : PChar;
Specifies the type of the routine passed as the MergeName
parameter to MergeSort. MergeSort calls this function to obtain
the name of each merge file when needed. MSORTP interfaces a
default MergeNameFunc that can be used in many circumstances. This
routine, DefaultMergeName, returns a name of the form
'SORnnnnn.TMP', where nnnnn is the merge file number given by
MergeNum. A MergeNameFunc must return a unique file name for each
value of MergeNum. Typically you specify a non-default MergeNameFunc
if you want the merge files to be located on a different drive or
directory than the current one.
MergeInfoRec =
record
SortStatus : Word; {Predicted status of sort, assuming disk ok}
MergeFiles : Word; {Total number of merge files created}
MergeHandles : Word; {Maximum file handles used}
MergePhases : Word; {Number of merge phases}
MaxDiskSpace : LongInt; {Maximum peak disk space used}
HeapUsed : LongInt; {Heap space actually used}
SelectorCount: Word; {Number of selectors allocated}
RecsPerSel : Word; {Records stored in each selector}
end;
Describes the structure returned by the MergeInfo function. This
function predicts the status of a sort and its resource usage
given certain information about it. See MergeInfo for more
information.
Procedures and Functions
========================
Declaration
procedure AbortSort;
Purpose
Halts a sort prematurely.
Description
Call this routine from your Less, SendToSortEngine, or
GetFromSortEngine routines to abort a sort. The Less function must
always return False if it calls AbortSort. When you call
AbortSort, MergeSort always returns a status of 1.
See Also
MergeSort
Declaration
function DefaultMergeName(Dest : PChar; MergeNum : Word) : PChar;
Purpose
Returns a default name for each merge file.
Description
The default merge name is SORnnnnn.TMP, where nnnnn corresponds to
the MergeNum. For example, for MergeNum = 1, DefaultMergeName
returns 'SOR1.TMP'. For MergeNum = 999, DefaultMergeName returns
'SOR999.TMP'. For MergeNum = 65535 (the largest possible value),
DefaultMergeName returns 'SOR65535.TMP'.
DefaultMergeName stores the null-terminated string in the buffer
pointed to by Dest, and it returns Dest as the function result.
MergeSort provides an 80 character buffer each time it calls the
merge name function.
You can write your own merge name function that puts the merge files
somewhere besides the current directory. Follow the example of
DefaultMergeName and pass your function to MergeSort.
See Also
MergeSort
Declaration
function GetElement(var X) : Boolean;
Purpose
Returns the next element in sorted order.
Description
Call this routine repeatedly in your GetFromSortEngine routine to
retrieve the sorted elements. GetElement returns True until there
are no more sorted elements to retrieve. GetElement copies the next
element into the variable you pass as the parameter X. Be sure that
this variable is large enough to hold an entire record; otherwise
GetElement will overwrite memory. When GetElement returns False, the
parameter X is not initialized.
See Also
PutElement
Declaration
procedure MergeInfo(MaxHeapToUse : LongInt;
RecLen : Word;
NumRecs : LongInt;
var MI : MergeInfoRec);
Purpose
Predicts status and resource usage of a merge sort.
Description
MaxHeapToUse is the maximum number of bytes of heap space the sort
should use. MergeInfo actually allocates heap space up to this
amount; if there is less heap space available, the MergeInfo results
apply only to the available heap space.
RecLen is the size in bytes of each record to be sorted. NumRecs is
the total number of records to be sorted.
MI returns information about the sort. MI.SortStatus is zero if the
sort is predicted to succeed. MergeInfo assumes that there is
sufficient disk space and that no disk errors will occur.
MI.MergeFiles is the total number of merge files that will be
created.
MI.MergeHandles is the total number of file handles used. This will
always be in the range of 0 to MergeOrder+1 inclusive.
MI.MergePhases is the number of merge phases. A value of 0 indicates
that the sort can be done completely in memory. 1 indicates that
MergeOrder or fewer merge files are created and will be merged in
one pass. Higher values mean that multiple merge passes are
required, with the output from earlier passes feeding the input of
later passes.
MI.MaxDiskSpace is the peak disk space required. Since merge files
are deleted as soon as they are used, the disk space used in a merge
sort grows and shrinks. All merge files are deleted when the sort is
complete. MaxDiskSpace is always smaller than 2*RecLen*NumRecs. The
analysis that MergeInfo performs to determine MaxDiskSpace requires
that MI.MergeFiles be smaller than 16384, and that 4*MI.MergeFiles
bytes of heap space be free when MergeInfo is called. If these
requirements aren't met, MergeInfo returns -1 for MI.MaxDiskSpace.
MI.HeapUsed is the number of bytes of heap space the sort will
actually use. This is always less than or equal to MaxHeapToUse.
MI.SelectorCount is the number of segments of heap space the sort
will allocate.
MI.RecsPerSel is the number of records stored in each segment. This
is always a power of two.
See Also
MinimumHeapToUse OptimumHeapToUse
Declaration
function MergeSort(MaxHeapToUse : LongInt;
RecLen : Word;
SendToSortEngine : ElementIOProc;
Less : ElementCompareFunc;
GetFromSortEngine : ElementIOProc;
MergeName : MergeNameFunc) : Word;
Purpose
Sorts a group of elements.
Description
MaxHeapToUse specifies the maximum number of bytes of heap space the
sort will use. It is not an error for MaxHeapToUse to exceed
MemAvail; MergeSort will use whatever is available. If you know in
advance how many records will be sorted, it is a good idea to pass
the result returned by OptimumHeapToUse for this parameter.
RecLen is the number of bytes in each record to be sorted.
SendToSortEngine is a procedure you write that passes the sort
elements into the sort engine. Your procedure calls PutElement for
each element to be sorted. If PutElement returns False, an error has
occurred (usually out of disk space) and you should exit your
SendToSortEngine procedure.
Less is a function you write that compares pairs of elements. This
function must return True if and only if element "X" (the first
parameter) is strictly less than element "Y" (the second parameter).
GetFromSortEngine is a procedure you write that retrieves the sorted
elements from the sort engine. Your procedure calls GetElement to
retrieve each element in sorted order. When GetElement returns
False, all elements have been retrieved or an error has occurred.
MergeName is a function that provides a name for each merge file.
You can often pass DefaultMergeName for this parameter.
DefaultMergeName puts all of the merge files in the current
directory at the time of the merge sort.
MergeSort returns a status code in its function result. It can
return the following values:
0 success
1 user abort (AbortSort was called)
8 insufficient memory to sort
106 invalid input parameter
(RecLen zero, MaxHeapToUse too small)
204 invalid pointer returned by GlobalLock, or
SelectorInc <> 8
213 no elements available to sort
214 more than 65535 merge files
else DOS or Turbo Pascal error code
The most common Turbo Pascal error code returned by MergeSort is
101, which means that there was insufficient disk space to store the
merge files.
If you want to predict whether a sort can succeed before actually
attempting it, use the MergeInfo procedure.
See Also
DefaultMergeName GetElement MergeInfo PutElement
Declaration
function MinimumHeapToUse(RecLen : Word) : LongInt;
Purpose
Returns the minimum heap space that allows MergeSort to succeed.
Description
Given the size of each record (RecLen), MinimumHeapToUse returns the
smallest amount of heap space that will allow a sort to succeed. You
can pass this value to MergeSort to sort a group of elements using
the smallest amount of memory. Note that the value returned by
MinimumHeapToUse is often very small and can cause a significant
amount of merging, so it's generally better to multiply the result
by a reasonable factor (say 2-4) even if you want to minimize heap
usage of a sort.
Because of a quirk in the BP7 DPMI heap manager, MinimumHeapToUse
adds 2048 bytes to the actual computed heap requirement. This is
important because sometimes the DPMI heap manager consumes about
2000 bytes of heap space for its internal use.
See Also
OptimumHeapToUse
Declaration
function OptimumHeapToUse(RecLen : Word;
NumRecs : LongInt) : LongInt;
Purpose
Returns the smallest heap space for a sort with no merging.
Description
Given the size of each record (RecLen) and the number of records to
be sorted (NumRecs), OptimumHeapToUse returns the amount of heap
space needed to perform the sort entirely in memory. Additional heap
space does not help the sort. Less heap space causes merging.
Because of a quirk in the BP7 DPMI heap manager, OptimumHeapToUse
adds 2048 bytes to the actual computed heap requirement. This is
important because sometimes the DPMI heap manager consumes about
2000 bytes of heap space for its internal use.
See Also
MinimumHeapToUse
Declaration
function PutElement(var X) : Boolean;
Purpose
Submits an element to the sort system.
Description
Call this function within your SendToSortEngine routine for each
element to be sorted. Pass the element to be sorted in the untyped
parameter X. PutElement returns True if the element is successfully
processed by MergeSort. It returns False if an error occurred; do
not continue to call PutElement in this case.
See Also
GetElement
How MSORTP Works
---------------------------------------------------------------------
The MergeSort function allocates a group of identical segments by
calling GlobalAlloc repeatedly. (When MSORTP is compiled for DOS real
mode, it provides a substitute routine for GlobalAlloc that is just a
shell around the normal Turbo Pascal GetMem routine.) Each segment
holds an even power-of-two number of records. The segments
collectively use no more than MaxHeapToUse bytes. The last record
element of the group of segments is used to store a pivot element. The
next to last record element is usually used for an element swap buffer
(see below). The remaining elements are collectively called the run
buffer and are used for an in-memory sort of each "run" of elements.
The "run" must number at least MinRecsPerRun or the sort will not
proceed. MergeSort also chooses the segment size so that there are at
least MergeOrder+1 segments, since these segments are reused as
buffers later during the merge process.
To avoid the use of complicated code for writing and reading buffers
and an excessive number of LongInt operations, each segment must be
smaller than 65536 bytes. Subject to the restrictions already
mentioned, however, the segment size is maximized. Hence, typical
segment sizes are 32768-65535 bytes. If the segment size is 32768, the
maximum memory MergeSort can access is 8MB for the default value of
MaxSelectors (256). If this situation adversely effects the
performance of a sort, MaxSelectors can be increased to 512 to provide
access to the full 16MB of RAM that the DPMI manager can use. The cost
is 512 additional bytes of data segment space. Conversely, if a sort
will never need to access 8MB of memory, MaxSelectors can be reduced
and data segment usage will decrease proportionately.
Elements are obtained when SendToSortEngine calls PutElement, which
adds elements to the run buffer. When the run buffer fills, or when
there are no more elements, the run buffer is sorted in memory using a
non-recursive quicksort which calls the Less function to compare
elements. If all elements fit into a single run buffer, no merging is
required and GetFromSortEngine retrieves the elements directly from
the sorted run buffer by calling GetElement.
If there are more elements than fit in one run buffer, the sorted run
buffer is written to a merge file assigned a sequential number. Merge
files are created until all elements have been obtained and all run
buffers sorted.
When all merge files have been created, the merge process starts. The
first MergeOrder segments of memory are reused as input buffers for
reading from up to MergeOrder input files opened simultaneously. The
next segment is used as an output buffer for storing the merged
result. Any other allocated segments are not used during merging. Up
to MergeOrder merge files are opened for input and a bufferful is read
from each merge file into its buffer. The first unused element in each
buffer is compared and the smallest is sent to the output. When the
elements in an input file are exhausted, the file is closed and
deleted. When all input files are gone, the output file is closed. The
output file then becomes part of the input merge sequence for the next
merge phase.
When fewer than MergeOrder input files remain, each merged output
element is passed directly back to the caller via the GetElement
routine instead of writing it to another merge file. In this way,
MergeSort avoids a final pass of writing to disk and reading back the
records.
Due to a quirk of the BP7 DPMI heap manager, MergeInfo or MergeSort
may appear to leak heap space. Both routines deallocate all of the
segments that they initially allocate. The quirk is that, after the
DPMI heap manager makes a small number of allocations, it allocates
about 2000 bytes of memory for itself and does not release this memory
even if all of the user allocations are later released. Hence,
MergeInfo may return a value of HeapUsed about 2000 bytes larger than
needed, and MemAvail may be 2000 bytes smaller after you call
MergeInfo or MergeSort. OptimumHeapToUse and MinimumHeapToUse also
return a value 2048 bytes bigger than may be needed because of this
quirk.
When MergeSort performs an in-memory quick sort of a run, the size of
the data record determines whether MergeSort swaps pointers to records
or the complete records. When record size is greater than or equal to
SwapThreshold (default 64), MergeSort swaps pointers. Otherwise, it
swaps entire records. When it swaps pointers, MergeSort must allocate
an additional 4 bytes of space for each record to hold the swap
pointer. It must also reserve one segment for writing the sorted run
to a merge file. Given the default SwapThreshold value of 64, the
memory overhead of swapping pointers is typically less than 10%.
Swapping pointers gives large gains in sort speed (2x or more) for
large records that can be sorted completely in memory.
Version History
---------------------------------------------------------------------
Version 5.40 1/93
Initial release. Version number synchronized with that of B-Tree
Filer.
Copyright and License Information
---------------------------------------------------------------------
MSORTP is copyright (c) 1993 by TurboPower Software. All Rights
Reserved.
Although the MSORTP files are copyrighted, you may use and distribute
them freely as long as you do not sell them or include them with other
software that you sell.
The latest version of MSORTP is always available in LIB 6 of the
PCVENB forum of CompuServe. It's also available on the TurboPower BBS
at 719-260-9726. Look for MSORTP.ZIP.
MSORTP is an excerpt from TurboPower Software's product B-Tree Filer.
B-Tree Filer is a collection of routines for managing single user and
network databases, as well as for accessing common network functions
on Novell and NetBIOS networks. Download FILER.BRO from LIB 6 of
PCVENB for more information, or give TurboPower a call.
You can reach TurboPower at:
TurboPower Software
P.O. Box 49009
Colorado Springs, CO 80949-9009
719-260-6641 (voice, Monday-Friday 9AM-5PM)
719-260-7151 (fax)
719-260-9136 (sales)
719-260-9726 (BBS)
Compuserve: 76004,2611